Automating Feature Set Selection for Case-Based Learning of Linguistic Knowledge
نویسنده
چکیده
This paper addresses the issue of "algorithm vs. representation" for case-based learning of linguistic knowledge. We first present empirical evidence that the success of case-based learning methods for natural language processing tasks depends to a large degree on the feature set used to describe the training instances. Next, we present a technique for automating feature set selection for case-based learning of linguistic knowledge. Given as input a baseline case representation, the method modifies the representation in response to a number of predefined linguistic biases by adding, deleting, and weighting features appropriately. We apply the linguistic bias approach to feature set selection to the problem of relative pronoun disambiguation and show that the casebased learning algorithm improves as relevant biasses are incorporated into the underlying instance representation. Finally, we argue that the linguistic bias approach to feature set selection offers new possibilities for case-based learning of natural language: it simplifies the process of instance representation design and, in theory, obviates the need for separate instance representations for each linguistic knowledge acquisition task. More importantly, the approach offers a mechanism for explicitly combining the frequency information available from corpus-based techniques with linguistic bias information employed in traditional linguistic and knowledge-based approaches to natural language processing.
منابع مشابه
Integrating case-based learning and cognitive biases for machine learning of natural language
This paper shows that psychological constraints on human information processing can be used effectively to guide feature set selection for case-based learning of linguistic knowledge. Given as input a baseline case representation for a natural language learning task, our algorithm selects the relevant cognitive biases for the task and then automatically modifies the representation in response t...
متن کاملA Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)
Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...
متن کاملOnline Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کامل